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II Abstract 

Principal component analysis (PCA) is widely used for dimensionality reduction, with well-documented 
merits in various applications involving high-dimensional data, including computer vision, preference 
measurement, and bioinformatics. In this context, the fresh look advocated here permeates benefits from 
variable selection and compressive sampling, to robustify PCA against outliers. A least-trimmed squares 
estimator of a low-rank bilinear factor analysis model is shown closely related to that obtained from an Iq- 
(pseudo)norm-regularized criterion encouraging sparsity in a matrix explicitly modeling the outliers. This 
connection suggests robust PCA schemes based on convex relaxation, which lead naturally to a family of 
robust estimators encompassing Huber's optimal M-class as a special case. Outliers are identified by tuning 
a regularization parameter, which amounts to controlling sparsity of the outlier matrix along the whole 
robustification path of (group) least-absolute shrinkage and selection operator (Lasso) solutions. Beyond 
its neat ties to robust statistics, the developed outlier-aware PCA framework is versatile to accommodate 
novel and scalable algorithms to: i) track the low-rank signal subspace robustly, as new data are acquired 
in real time; and ii) determine principal components robustly in (possibly) infinite-dimensional feature 
spaces. Synthetic and real data tests corroborate the effectiveness of the proposed robust PCA schemes, 
when used to identify aberrant responses in personality assessment surveys, as well as unveil communities 
in social networks, and intruders from video surveillance data. 
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I. Introduction 

Principal component analysis (PCA) is the workhorse of high-dimensional data analysis and dimen- 
sionality reduction, with numerous applications in statistics, engineering, and the biobehavioral sciences; 
see, e.g., |[22ll . Nowadays ubiquitous e-commerce sites, the Web, and urban traffic surveillance systems 
generate massive volumes of data. As a result, the problem of extracting the most informative, yet low- 
dimensional structure from high-dimensional datasets is of paramount importance lITTl . To this end, PCA 
provides least-squares (LS) optimal linear approximants in M'^ to a data set in M^, for q < p. The desired 
linear subspace is obtained from the q dominant eigenvectors of the sample data covariance matrix ll22l . 

Data obeying postulated low-rank models include also outliers, which are samples not adhering to those 
nominal models. Unfortunately, LS is known to be very sensitive to outliers |fT9l . |[32l . and this undesirable 
property is inherited by PCA as well ll22l . Early efforts to robustify PCA have relied on robust estimates of 
the data covariance matrix; see, e.g., ||4l. Related approaches are driven from statistical physics |[39l . and 
also from M-estimators IS]. Recently, polynomial-time algorithms with remarkable performance guarantees 
have emerged for low-rank matrix recovery in the presence of sparse - but otherwise arbitrarily large - 
errors [51, Q. This pertains to an 'idealized robust' PCA setup, since those entries not affected by 
outliers are assumed error free. Stability in reconstructing the low-rank and sparse matrix components 
in the presence of 'dense' noise have been reported in ll38l . ll42l . A hierarchical Bayesian model was 
proposed to tackle the aforementioned low-rank plus sparse matrix decomposition problem in ||9l- 

In the present paper, a robust PCA approach is pursued requiring minimal assumptions on the outlier 
model. A natural least-trimmed squares (LTS) PCA estimator is first shown closely related to an estimator 
obtained from an ^o-(pseudo)norm-regularized criterion, adopted to fit a low-rank bilinear factor analysis 
model that explicitly incorporates an unknown sparse vector of outliers per datum (Section As in 
compressive sampling |[35l . efficient (approximate) solvers are obtained in Section JIIJ by surrogating the 
^o-norm of the outlier matrix with its closest convex approximant. This leads naturally to an M-type PCA 
estimator, which subsumes Ruber's optimal choice as a special case |[T3l . Unlike Ruber's formulation 
though, results here are not confined to an outlier contamination model. A tunable parameter controls the 
sparsity of the estimated matrix, and the number of outliers as a byproduct. Hence, effective data-driven 
methods to select this parameter are of paramount importance, and systematic approaches are pursued 
by efficiently exploring the entire robustifaction (a.k.a. homotopy) path of (group-) Lasso solutions ifTTl . 
Bn . In this sense, the method here capitalizes on but is not limited to sparse settings where outliers 
are sporadic, since one can examine all sparsity levels along the robustification path. The outlier-aware 
generative data model and its sparsity-controlling estimator are quite general, since minor modifications 
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discussed in Section UlI-CI enable robustifiying linear regression |[T4l . dictionary learning ll24l . 041 . and 
K-means clustering as well ifTlll . lITTl . Section |IV] deals with further modifications for bias reduction 
through nonconvex regularization, and automatic determination of the reduced dimension q. 

Beyond its neat ties to robust statistics, the developed outlier-aware PCA framework is versatile to 
accommodate scalable robust algorithms to: i) track the low-rank signal subspace, as new data are acquired 
in real time (Section IVjl; and ii) determine principal components in (possibly) infinite-dimensional feature 
spaces, thus robustifying kernel PCA |[33l . and spectral clustering as well ifTTl p. 544] (Section rvD ). 
The vast literature on non-robust subspace tracking algorithms includes Il24l . POl . and 121; see also |[T8l 
for a first-order algorithm that is robust to outliers and incomplete data. Relative to lITSl . the online 
robust (OR-) PCA algorithm of this paper is a second-order method, which minimizes an outlier-aware 
exponentially-weighted LS estimator of the low-rank factor analysis model. Since the outlier and subspace 
estimation tasks decouple nicely in OR-PCA, one can readily devise a first-order counterpart when minimal 
computational loads are at a premium. In terms of performance, online algorithms are known to be 
markedly faster than their batch alternatives O, |[T8l . e.g., in the timely context of low-rank matrix 
completion ||29l . ll30l . While the focus here is not on incomplete data records, extensions to account for 
missing data are immediate and will be reported elsewhere. 

In Section |VII[ numerical tests with synthetic and real data corroborate the effectiveness of the proposed 
robust PCA schemes, when used to identify aberrant responses from a questionnaire designed to measure 
the Big-Five dimensions of personality traits |[2T1 . as well as unveil communities in a (social) network of 
college football teams ||T5l . and intruders from video surveillance data |[8l. Concluding remarks are given 
in Section I Villi while a few technical details are deferred to the Appendix. 

Notation: Bold uppercase (lowercase) letters will denote matrices (column vectors). Operators (•)', tr(-), 
med(-), and will denote transposition, matrix trace, median, and Hadamard product, respectively. Vector 
diag(M) collects the diagonal entries of M, whereas the diagonal matrix diag(v) has the entries of v on 
its diagonal. The £p-norm of x G M" is ||x||p := {Y.7=i \xi\Pf^^ for p > 1; and \\M\\f := 0r (MM') is 
the matrix Frobenious norm. The n x n identity matrix will be represented by I„, while 0„ will denote 
the n X 1 vector of all zeros, and 0„xm := 0„0^. Similar notation will be adopted for vectors (matrices) 
of all ones. The i-th vector of the canonical basis in M" will be denoted by b„ j, i = 1, . . . ,n. 

II. Robustifying PCA 

Consider the standard PCA formulation, in which a set of data Ty := {yn}^=i in the p-dimensional 
Euclidean input space is given, and the goal is to find the best g-rank {q < p) linear approximation to the 
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data in 7^; see e.g., ll22l . Unless otherwise stated, it is assumed throughout that the value of q is given. 
One approach to solving this problem, is to adopt a low-rank bilinear (factor analysis) model 

y„ = m + Us.„ + e„, n = l,...,N (1) 

where m G is a location (mean) vector; matrix U G W^'^ has orthonormal columns spanning the 
signal subspace; {sn}n=i are the so-termed principal components, and {en}n=i are zero-mean i.i.d. 
random errors. The unknown variables in ([T) can be collected in V := {m, U, {s„}^^]^}, and they are 
estimated using the LS criterion as 

N 

min ||y„ — m — Us„||2, s. to U'U = 1^. (2) 

^ n=l 

PC A in Q is a nonconvex optimization problem due to the bilinear terms Us„, yet a global optimum V can 
be shown to exist; see e.g., HOl . The resulting estimates are rh = Yl,n=i Yn/N and s„ = U'(y„ — m), n = 
1, . . . , N; while U is formed with columns equal to the g-dominant right singular vectors of the N x p 
data matrix Y := [yi, . . . , yA^]' HZl P- 535]. The principal components (entries of) s„ are the projections 
of the centered data points {y„ — m}^^i onto the signal subspace. Equivalently, PCA can be formulated 
based on maximum variance, or, minimum reconstruction error criteria; see e.g., ll22l . 

A. Least-trimmed squares PCA 

Given training data Tx ■= {:x.n}n=i possibly contaminated with outliers, the goal here is to develop 
a robust estimator of V that requires minimal assumptions on the outlier model. Note that there is an 
explicit notational differentiation between: i) the data in Ty which adhere to the nominal model ([T]); and 
ii) the given data in Tx that may also contain outliers, i.e., those x„ not adhering to ([T]). Building on LTS 
regression |[32| , the desired robust estimate Vlts '■= {^Ij U, {s^j^^j^} for a prescribed v < N can be 
obtained via the following LTS PCA estimator [cf. ([2])] 

Vlt5 :=arg mm J^rf„](V), s. to U'U = I, (3) 

n=l 

where ?'p„](V) is the n-th order statistic among the squared residual norms rf (V), . . . , r|^(V), and r„(V) := 
||x„ — m — Us„||2. The so-termed coverage v determines the breakdown point of the LTS PCA estima- 
tor ||32l . since the N — v largest residuals are absent from the estimation criterion in ([3]l. Beyond this 
universal outlier-rejection property, the LTS -based estimation offers an attractive alternative to robust linear 
regression due to its high breakdown point and desirable analytical properties, namely \/]V-consistency 
and asymptotic normality under mild assumptions |[32l . 
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Remark 1 (Robust estimation of the mean): In most applications of PC A, data in Ty are typically 
assumed zero mean. This is without loss of generality, since nonzero-mean training data can always 
be rendered zero mean, by subtracting the sample mean Yl,n=i'yn/N from each y„. In modeling zero- 
mean data, the known vector m in ([T]) can obviously be neglected. When outliers are present however, data 
in Tx are not necessarily zero mean, and it is unwise to center them using the non-robust sample mean 
estimator which has a breakdown point equal to zero |[32l . Towards robustifying PC A, a more sensible 
approach is to estimate m robustly, and jointly with U and the principal components {sn}n=i- 

Because Q is a nonconvex optimization problem, a nontrivial issue pertains to the existence of the 
proposed LTS PCA estimator, i.e., whether or not Q attains a minimum. Fortunately, the answer is in 
the affirmative as asserted next. 

Property 1: Tlie LTS PCA estimator is well defined, since ^ has (at least) one solution. 
Existence of Vlts can be readily established as follows: i) for each subset of T with cardinality v (there 
are (^) such subsets), solve the corresponding PCA problem to obtain a unique candidate estimator per 
subset; and ii) pick Vlts as the one among all ('^) candidates with the minimum cost. 

Albeit conceptually simple, the solution procedure outlined under Property [T]is combinatorially complex, 
and thus intractable except for small sample sizes N. Algorithms to obtain approximate LTS solutions in 
large-scale linear regression problems are available; see e.g., ||32| . 

B. lo-norm regularization for robustness 

Instead of discarding large residuals, the alternative approach here explicitly accounts for outliers in the 
low-rank data model ([Hi. This becomes possible through the vector variables {o„}^^^ one per training 
datum x„, which take the value o„ ^ Op whenever datum n is an outlier, and o„ = Op otherwise. Thus, 
the novel outlier-aware factor analysis model is 

x„ = y„ + o„ = m + Us„ + e„ + o„, n = l,...,N (4) 

where o„ can be deterministic or random with unspecified distribution. In the under-determined linear 
system of equations both V as well as the N x p matrix O := [oi, . . . , oat]' are unknown. The 
percentage of outliers dictates the degree of sparsity (number of zero rows) in O. Sparsity control will 
prove instrumental in efficiently estimating O, rejecting outliers as a byproduct, and consequently arriving 
at a robust estimator of V. To this end, a natural criterion for controlling outlier sparsity is to seek the 
estimator [cf. Q] 

{V,6} = argmin||X-ljvm'-SU'-0||| + Ao||0||o, s. to U'U = Ig (5) 
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where X := [xi, . . . ,XAr]' e R^'^p, S := [si, . . . ,SAr]' G M^^^^ and ||0||o denotes the nonconvex 4- 
norm that is equal to the number of nonzero rows of O. Vector (group) sparsity in the rows 6„ of O can 
be directly controlled by tuning the parameter Aq > 0. 

As with compressive sampling and sparse modeling schemes that rely on the £o-norm |[35ll . the robust 
PCA problem ^ is NP-hard |[26l . In addition, the sparsity-controUing estimator ^ is intimately related 
to LTS PCA, as asserted next. 

Proposition 1: If {V,0} minimizes (HJ with Xq chosen such that ||0||o = — i^, then Vlts = V. 

Proof: Given Aq such that ||0||o = — z^, the goal is to characterize V as well as the positions and 
values of the nonzero rows of O. Note that because ||0||o = A^ — i^, the last term in the cost of ^ is 
constant, hence inconsequential to the minimization. Upon defining r„ := x„ — rh — Us^i, it is not hard 
to see from the optrmality conditions that the rows of O satisfy 

{Op, llrnib < ^ 
, n=l,...,N. (6) 
rn, ||r„||2 > 

This is intuitive, since for those nonzero 6„ the best thing to do in terms of minimizing the overall cost 
is to set On = r-n, and thus null the corresponding squared-residual terms in In conclusion, for the 
chosen value of Aq it holds that N — u squared residuals effectively do not contribute to the cost in ([5]). 

To determine V and the row support of 6, one alternative is to exhaustively test all (^^^) = (^) admis- 
sible row-support combinations. For each one of these combinations (indexed by j), let Sj C {1, . . . , A^} 
be the index set describing the row support of O^-'^ i.e., 6^^^ ^ Op if and only if n G Sj; and \Sj\ = N — u. 
By virtue of Q, the corresponding candidate V^-'^ solves miiiy Yln&s ^ni'^) subject to U'U = Ig, while 
V is the one among all {V^-')} that yields the least cost. Recognizing the aforementioned solution procedure 
as the one for LTS PCA outlined under Property [1] it follows that Vlts = V- ■ 

The importance of Proposition [T] is threefold. First, it formally justifies model (|4]) and its estimator 
(|5]) for robust PCA, in light of the well documented merits of LTS ||32l . Second, it further solidifies the 
connection between sparsity-aware learning and robust estimation. Third, problem Q lends itself naturally 
to efficient (approximate) solvers based on convex relaxation, the subject dealt with next. 

III. Sparsity-Controlling Outlier Rejection 

Recall that the row-wise ^2-norm sum ||B||2,r := ^n=i l|t>n||2 of matrix B := [bi, . . . ,b7v]' G M^^^ 
is the closest convex approximation of ||B||o. This property motivates relaxing problem (|5]l to 

min||X-l7vm'-SU'-0||| + A2||0||2r, s. to U'U = I„. (7) 
v,o 
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The nondifferentiable ^2 -norm regularization term encourages row-wise (vector) sparsity on the estimator 
of O, a property that has been exploited in diverse problems in engineering, statistics, and machine 
learning lITTl . A noteworthy representative is the group Lasso 114111 . a popular tool for joint estimation and 
selection of grouped variables in linear regression. 

It is pertinent to ponder on whether problem ^ still has the potential of providing robust estimates V 
in the presence of outliers. The answer is positive, since it is shown in the Appendix that (|7]) is equivalent 
to an M-type estimator 

N 

min ^ Pvi^n — m — Us„), s. to U'U = Ig (8) 

^ n=l 

where : — >• M is a vector extension to Ruber's convex loss function jTOl ; see also |[23l . and 

. . / l|r|li, ||r||2 < A2/2 

Pv{r) := < . (9) 

[ A2||r||2-A2/4, ||r||2>A2/2 

M-type estimators (including Ruber's) adopt a fortiori an e-contaminated probability distribution for the 
outliers, and rely on minimizing the asymptotic variance of the resultant estimator for the least favorable 
distribution of the e-contaminated class (asymptotic min-max approach) (19]. The assumed degree of 
contamination specifies the tuning parameter A2 in ^ (and thus the threshold for deciding the outliers 
in M-estimators). In contrast, the present approach is universal in the sense that it is not confined to any 
assumed class of outlier distributions, and can afford a data-driven selection of the tuning parameter. In a 
nutshell, M-estimators can be viewed as a special case of the present formulation only for a specific choice 
of A2, which is not obtained via a data-driven approach, but from distributional assumptions instead. 

All in all, the sparsity-controlling role of the tuning parameter A2 > in Q is central, since model 
([Hi and the equivalence of Q with ([8]l suggest that A2 is a robustness-controlling constant. Data-driven 
approaches to select A2 are described in detail under Section ITlI-B I Before dwelling into algorithmic issues 
to solve Q, a couple of remarks are in order. 

Remark 2 (^i-norm regularization for entry-wise outliers): In computer vision applications where ro- 
bust PCA schemes are particularly attractive, one may not wish to discard the entire (vectorized) images 
x„, but only specific pixels deemed as outliers lU. This can be accomplished by replacing ||0||2,r in © 
with II O 111 := "^^-i ||on||i, a Lasso-type regularization that encourages entry-wise sparsity in 6. 
Remark 3 (Outlier rejection): From the equivalence between problems ([7]) and ([8]), it follows that those 
data points x„ deemed as containing outliers (6„ / Op) are not completely discarded from the estimation 
process. Instead, their effect is downweighted as per Ruber's loss function [cf. ©J. Nevertheless, explicitly 
accounting for the outliers in O provides the means of identifying and removing the contaminated data 
altogether, and thus possibly re -running PCA on the outlier-free data. 
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A. Solving the relaxed problem 

To optimize ([7]) iteratively for a given value of A2, an alternating minimization (AM) algorithm is adopted 
which cyclically updates m(A;) — S(/c) — U(A;) — 0{k) per iteration A; = 1,2, . . .. AM algorithms are 
also known as block-coordinate-descent methods in the optimization parlance; see e.g., [31, l[36l . To update 
each of the variable groups, ([T]) is minimized while fixing the rest of the variables to their most up-to-date 
values. While the overall problem (|7]) is not jointly convex with respect to (w.r.t.) {S, U, O, m}, fixing all 
but one of the variable groups yields subproblems that are efficiently solved, and attain a unique solution. 

Towards deriving the updates at iteration k and arriving at the desired algorithm, note first that the 
mean update is m(A;) = (X — 0(A;))'lAr/iV. Next, form the centered and outlier-compensated data matrix 
Xo(/i;) :=X — lArm(A:)' — 0(/c — 1). The principal components are readily given by 

S(A;) = argmm ||Xo(A:) - SU(A: - = Xo(A;)U(A; - 1). 

Continuing the cycle, U(A;) solves 

min ||Xo(/c) - S(A;)U'|||, s. to U'U = Ig 

a constrained LS problem also known as reduced-rank Procrustes rotation P3l . The minimizer is given 
in analytical form in terms of the left and right singular vectors of 'K'^{k)S{k) 1431 Thm. 4]. In detail, 
one computes the SVD of l^^{k)S{k) = L(/fc)D(/fc)R'(/fc) and updates \J{k) = L(A:)R'(A;). Next, the 
minimization of (|7]) w.r.t. O is an orthonormal group Lasso problem. As such, it decouples across rows 
On giving rise to N ^2-norm regularized subproblems, namely 

On{k) = argmin ||r„,(fc) - o||2 + A2II0II2, n = 1, . . . , 

o 

where r„(/c) := x„ — m.{k) — U(/i;)s„(fc). The respective solutions are given by (see e.g., |[27l ) 

o„(.) = ^*W|i^, n=^,...,N (10, 

l|rn(A;)||2 

where (•)_!_ := max(-,0). For notational convenience, these N parallel vector soft-thresholded updates 
are denoted as 0{k) = 5 [X - lArm'(A: - 1) - S(fc)U'(A;), (A2/2)lAr] under Algorithm [U where the 
thresholding operator S sets the entire outlier vector On{k) to zero whenever ||r„(fc)||2 does not exceed 
A2/2, in par with the group sparsifying property of group Lasso. Interestingly, this is the same rule used 
to decide if datum x„ is deemed an outlier, in the equivalent formulation ([8]l which involves Ruber's loss 
function. Whenever an ^i-norm regularizer is adopted as discussed in Remark |2l the only difference is 
that updates ([TOl l boil down to soft-thresholding the scalar entries of r„(A;). 
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Algorithm 1 : Batch robust PC A solver 
Set U(0) = 1 : q) and 0(0) = Ojvxp. 

for fc = 1,2, . . . do 

Update m(fc) = (X - 0(fc - 

Form Xo(fc) = X - lArm'(fc) - 0(fc - 1). 

Update S(fc) = Xo(fc)U(fc - 1). 

Obtain L(fc)D(fc)R(fc)' = svd[X;(fc)S(fc)] and update U(fc) = L(fc)R'(fc). 
Update 0(fc) = 5 [X - lArm'(fc) - S(fc)U'(fc), (A2/2)Ijv] • 
end for 



The entire AM solver is tabulated under Algorithm [T] indicating also the recommended initialization. 
Algorithm [T] is conceptually interesting, since it explicitly reveals the intertwining between the outlier 
identification process, and the PCA low-rank model fitting based on the outlier compensated data Xo(A;). 

The AM solver is also computationally efficient. Computing the N x q matrix S(A;) = Xo(A;)U(A; — 1) 
requires Npq operations per iteration, and equally costly is to obtain 'K'^{k)S{k) G RP^"?. The cost 
of computing the SVD of X.'^{k)S{k) is of order 0{pq^), while the rest of the operations including 
the row-wise soft-thresholdings to yield 0{k) are Unear in both N and p. In summary, the total cost of 
Algorithm [T] is roughly kma.yiO{Np+pq^), where /cmax is the number of iterations required for convergence 
(typically fcmax = 5 to 10 iterations suffice). Because g < p is typically small. Algorithm [T] is attractive 
computationally both under the classic setting where N > p, and p is not large; as well as in high- 
dimensional data settings where p ^ N, a situation typically arising e.g., in microarray data analysis. 

Because each of the optimization problems in the per-iteration cycles has a unique minimizer, and the 
nondifferentiable regularization only affects one of the variable groups (O), the general results of ||36l 
apply to establish convergence of Algorithm \T\ as follows. 

Proposition 2: As k ^ oo, the iterates generated by Algorithm\J}converge to a stationary point of (|7]). 

B. Selection of X2: robustification paths 

Selecting A2 controls the number of outliers rejected. But this choice is challenging because existing 
techniques such as cross-validation are not effective when outliers are present |[32l . To this end, systematic 
data-driven approaches were devised in [14], which e.g., require a rough estimate of the percentage of 
outliers, or, robust estimates &^ of the nominal noise variance that can be obtained using median absolute 
deviation (MAD) schemes [19]. These approaches can be adapted to the robust PCA setting considered 
here, and leverage the robustification paths of (group-)Lasso solutions [cf. which are defined as the 
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solution paths corresponding to ||6„||2, n = 1, . . . , A^, for all values of A2. As A2 decreases, more vectors 
On enter the model signifying that more of the training data are deemed to contain outliers. 

Consider then a grid of Gx values of A2 in the interval [Amin, Amax]> evenly spaced on a logarithmic 
scale. Typically, Amax is chosen as the minimum A2 value such that O / Oatxp, while Amin = eAmax 
with e = 10~^, say. Because Algorithm [T] converges quite fast, (O can be efficiently solved over the grid 
of Gx values for A2. In the order of hundreds of grid points can be easily handled by initializing each 
instance of Algorithm 1 (per value of A2) using warm starts (VJt . This means that multiple instances of 
Q are solved for a sequence of decreasing A2 values, and the initialization of Algorithm [T] per grid point 
corresponds to the solution obtained for the immediately preceding value of A2 in the grid. For sufficiently 
close values of A2, one expects that the respective solutions will also be close (the row support of O will 
most likely not change), and hence Algorithm 1 will converge after few iterations. 

Based on the Gx samples of the robustification paths and the prior knowledge available on the outlier 
model dU), a couple of alternatives are also possible for selecting the 'best' value of A2 in the grid. A 
comprehensive survey of options can be found in |[T4l . 

Number of outliers is known: By direct inspection of the robustification paths one can determine the range 
of values for A2, such that the number of nonzero rows in O equals the known number of outliers sought. 
Zooming-in to the interval of interest, and after discarding the identified outliers, ET-fold cross-vaUdation 
methods can be applied to determine the 'best' A2. 

Nominal noise covariance matrix is known: Given := .Bfene'J, one can proceed as follows. Consider 
the estimates Vg obtained using ^ after sampling the robustification path for each point {\2,g}^=i- Next, 
pre-whiten those residuals corresponding to training data not deemed as containing outliers; i.e., form 

— 1/2 

T^g '■= {rn,g = "^e (xn — — UgS„ g) : n s. to 6„ = 0}, and find the sample covariance matrices 
{Sf g}^^^. The winner A2 := A2,g' corresponds to the grid point minimizing an absolute variance deviation 
criterion, namely g* := argmirig |tr[Sf^g] — p\. 

C. Connections with robust linear regression, dictionary learning, and clustering 

Previous efforts towards robustifying linear regression have pointed out the equivalence between M- 
type estimators and £i-norm regularized regression |[T3l . and capitalized on this neat connection under a 
Bayesian framework ll20l . However, they have not recognized the link to LTS via convex relaxation of 
the ^o-norm in The treatment here goes beyond linear regression by considering the PCA framework, 
which entails a more challenging bilinear factor analysis model. Linear regression is subsumed as a special 
case, when matrix U is not necessarily tall but assumed known, while s„ = s, V n = 1, . . . , A^. 
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As an alternative to PCA, it is possible to device dimensionality reduction schemes when the data 
admit a sparse representation over a perhaps unknown basis. Such sparse representations comprise only a 
few elements (atoms) of the overcomplete basis (a.k.a. dictionary) to reconstruct the original data record. 
Thus, each datum is represented by a coefficient vector whose effective dimensionality (number of nonzero 
coefficients) is smaller than that of the original data vector. Recently, the dictionary learning paradigm 
offers techniques to design a dictionary over which the data assume a sparse representation; see e.g., ll34l 
for a tutorial treatment. Dictionary learning schemes are flexible, in the sense that they utilize training 
data to learn an appropriate overcomplete basis customized for the data at hand Il24l . ||34l . 

However, as in PCA the criteria adopted typically rely on a squared-error loss function as a measure of 
fit, which is known to be very sensitive to outliers |[T9l . ||32l . Interestingly, one can conceivably think of 
robustifying dictionary learning via minor modifications to the framework described so far. For instance, 
with the same matrix notation used in e.g., one seeks to minimize 

min IIX - SU' - 0\\l + Ai||S||i + AallOlbr- (11) 
v,o 

Different from the low-rank outlier-aware model adopted for PCA [cf. here the dictionary U G M^'^'^ 
is fat {q ^ p), with column vectors that are no longer orthogonal but still constrained to have unit £2- 
norm. (This constraint is left implicit in (fTTl ) for simplicity.) Moreover, one seeks a sparse vector s„ to 
represent each datum x„, in terms of a few atoms of the learnt dictionary U. This is why ([TTI ) includes 
an additional sparsity-promoting ^i-norm regularization on S, that is not present in Sparsity is thus 
present both in the representation coefficients S, as well as in the outliers O. 

Finally, it is shown here that a generative data model for K-means clustering lITTl can share striking 
similarities with the bilinear model ([1]). Consequently, the sparsity-controUing estimator ^ can be adapted 
to robustify the K-means clustering task too lITll . Consider for instance that the data in Tx come from 
q clusters, each of which is represented by a centroid Uj G W, i = l,...,q. Moreover, for each 
input vector x„, K-means introduces the unknown membership variables s„j G {0,1}, i = l,...,q, 
where Sni = 1 whenever x„ comes from cluster i, and s„j = otherwise. Typically, the membership 
variables are also constrained to satisfy J2n=i Sni > V i (no empty clusters), and Yli=i ^ni = ^ ^ n 
(single cluster membership). Upon defining U := [ui,...,Ug] G W^'^ and the membership vectors 
Sn := [sni, ■ ■ ■ , SnqY G , a pertinent model for hard K-means clustering assumes that input vectors 
can be expressed as x„ = Us„ + e„ + o„, where e„ and o„ are as in (011. Because the aforementioned 
constraints imply ||s„||o = ||s„||i = 1 V n, if x„ belongs to cluster i, then s„j = 1 and in the absence 
of outliers one effectively has x„ = Uj + e^. Based on this data model, a natural approach towards 
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robustifying K-means clustering solves lfT2ll 

N q 

min||X-SU'-0||^ + A2||0||2,r-, s. to G {0, 1}, J]] s„i > 0, ^ s„i = 1. (12) 

' n=l i=l 

Recall that in the robust PCA estimator the subspace matrix is required to be orthonormal and the 
principal components are unrestrained. In the clustering context however, the centroid columns of U are 
free optimization variables, whereas the cluster membership variables adhere to the constraints in (fT2l ). 
Suitable relaxations to tackle the NP-hard problem (fT2l ) have been investigated in Ill2l . 

IV. Further Algorithmic Issues 
A. Bias reduction through nonconvex regularization 

Instead of substituting ||0||o in dSjl by its closest convex approximation, namely ||0||2,r> letting the 
surrogate function to be nonconvex can yield tighter approximations, and improve the statistical properties 
of the estimator. In rank minimization problems for instance, the logarithm of the determinant of the 
unknown matrix has been proposed as a smooth surrogate to the rank ifTTI : an alternative to the convex 
nuclear norm in e.g., ||29l . Nonconvex penalties such as the smoothly clipped absolute deviation (SCAD) 
have been also adopted to reduce bias iflOl . present in uniformly weighted ^i-norm regularized estimators 
such as (|7]l ifTT]. p. 92]. In the context of sparse signal reconstruction, the £o-norm of a vector was surrogated 
in Q by the logarithm of the geometric mean of its elements; see also ||28l . 

Building on this last idea, consider approximating ^ by the nonconvex formulation 

N 

min||X-ljvm'-SU'-0|||, + AoVlog(||o„||2 + <5), s. to U'U = (13) 
v,o ^-^ 

n=l 

where the small positive constant 6 is introduced to avoid numerical instability. Since the surrogate term 
in (fT3] ) is concave, the overall minimization problem is nonconvex and admittedly more complex to solve 
than dV]). Local methods based on iterative linearization of log(||o„||2 + (5) around the current iterate On{k), 
axe adopted to minimize (fT3] ). Skipping details that can be found in |[23l . application of the majorization- 
minimization technique to ([T3] ) leads to an iteratively-reweighted version of whereby A2 ^ XoWn{k) 
is used for updating o„(A:) in Algorithm [1] Specifically, per k = 1,2, . . . one updates 

0(A;) = 5 [X - iNm'ik - 1) - S(fc)U'(A;), (Ao/2)diag(u;i(A;), . . . , WN{k))] 

where the weights are given by Wn{k) = (||o„(A; — 1)||2 + 5)~^ , n = 1, . . . , N. Note that the thresholds 
vary both across rows (indexed by n), and across iterations. If the value of ||o„(A: — 1)||2 is small, then 
in the next iteration the regularization term AotL'n(A:)||o„||2 has a large weight, thus promoting shrinkage 
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of that entire row vector to zero. If ||o„(A: — 1)||2 is large, the cost in the next iteration downweighs the 
regularization, and places more importance to the LS component of the fit. 

All in all, the idea is to start from the solution of (|7]) for the 'best' A2, which is obtained using Algorithm 
[U This initial estimate is refined after runnning a few iterations of the iteratively-reweighted counterpart to 
Algorithm [2 Extensive numerical tests suggest that even a couple iterations of this second stage refinement 
suffices to yield improved estimates V, in comparison to those obtained from (|7]). The improvements can 
be leveraged to bias reduction - and its positive effect with regards to outlier support estimation - also 
achieved by similar weighted norm regularizers proposed for linear regression fTTI p. 92]. 

B. Automatic rank determination: from nuclear- to Frobenius-norm regularization 

Recall that q < p is the dimensionality of the subspace where the outlier-free data ([T]) are assumed to 
live in, or equivalently, q = rank[Y] in the absence of noise. So far, q was assumed known and fixed. 
This is reasonable in e.g., compression/quantization, where a target distortion-rate tradeoff dictates the 
maximum q. In other cases, the physics of the problem may render q known. This is indeed the case in 
array processing for direction-of-arrival estimation, where q is the dimensionality of the so-termed signal 
subspace, and is given by the number of plane waves impinging on a uniform linear array; see e.g., BOl . 

Other applications however, call for signal processing tools that can determine the 'best' q, as well as 
robustly estimate the underlying low-dimensional subspace U from data X. Noteworthy representatives 
for this last kind of problems include unveiling traffic volume anomalies in large-scale networks |[25l , 
and automatic intrusion detection from video surveillance frames Q, HI, just to name a few. A related 
approach in this context is (stable) principal components pursuit (PCP) |[38l . B2l . which solves 

mill ||X - L - Ofp + A*||L||* + A2IIOII2 r (14) 

L,0 

with the objective of reconstructing the low -rank matrix L G M^^*', as well as the sparse matrix of outliers 
O in the presence of dense noise with known variance^ Note that ||L||* denotes the matrix nuclear norm, 
defined as the sum of the singular values of L. The same way that the ^2-norm regularization promotes 
sparsity in the rows of O, the nuclear norm encourages a low-rank L since it effects sparsity in the 
vector of singular values of L. Upon solving the convex optimization problem ([T4l ). it is possible to obtain 
L = SU' using the SVD. Interestingly, (fT4l ) does not fix (or require the knowledge of) rank[L] a fortiori, 
but controls it through the tuning parameter A*. Adopting a Bayesian framework, a similar problem was 
considered in iQ- 



^Actually, |42| considers entrywise outliers and adopts an i'l-norm regularization on O. 
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Instead of assuming that q is known, suppose that only an upper bound q is given. Then, the class 
of feasible noise-free low-rank matrix components of Y in ([U admit a factorization L = SU', where S 
and U are N x q and p x q matrices, respectively. Building on the ideas used in the context of finding 
minimum rank solutions of Unear matrix equations |[29l . a novel alternative approach to robustifying PCA 
is to solve 

min ||X - SU' - 0\\l + ^(||U||^ + \\S\\l) + AallOHa,.. (15) 
u,s,o 2 MM. 

Different from (fT4l) and Q, a Frobenius-norm regularization on both U and S is adopted to control 

the dimensionality of the estimated subspace U. Relative to (I7]l, U in ( fTSl ) is not constrained to be 

orthonormal. It is certainly possible to include the mean vector m in the cost of ([TSl l. as well as an 

^i-norm regularization for entrywise outliers. The main motivation behind choosing the Frobenius-norm 

regularization comes from the equivalence of (fT4l ) with ([T5] ). as asserted in the ensuing result which 

adapts ||29l Lemma 5.1] to the problem formulation considered here. 

Lemma 1: If {L, O} minimizes (1141 ) and mnk\L\ < q, then (1141 ) and (1151 ) are equivalent. 

Proof: Because rank[L] < q, the relevant feasible subset of (fT4l) can be re-parametrized as {SU', O}, 
where S and U are N x q and p x q matrices, respectively. For every triplet {U, S, O} the objective of 
(flSl) is no smaller than the one of (fT4l) . since it holds that |[29ll 

||L||, =min^(||U||| + ||S|||,), s. to L = SU'. (16) 

One can show that the gap between the objectives of (fT4l ) and ( fTSl ) vanishes at O* := O, S* := U^S^/^, 
and U* := where L = U^SIV'^ is the SVD of L. Therefore, from the previous arguments it 

follows that ([T4l ) and ( fTSl ) attain the same global minimum objective, which completes the proof. ■ 

Even though problem ([TS] ) is nonconvex, the number of optimization variables is reduced from 2Np 
to Np + {N + p)q, which becomes significant when q is in the order of a few dozens and both N and 
p are large. Also note that the dominant A'^p-term in the variable count of ([TS] ) is due to O, which is 
sparse and can be efficiently handled. While the factorization L = SU' could have also been introduced 
in ([141 ) to reduce the number of unknowns, the cost in (fTSl ) is separable and much simpler to optimize 
using e.g., an AM solver comprising the iterations tabulated as Algorithm |2] The decomposability of the 
Frobenius-norm regularizer has been recently exploited for parallel processing across multiple processors 
when solving large-scale matrix completion problems ll30l . or to unveil network anomalies ll25l . 

Because ([TS] ) is a nonconvex optimization problem, most solvers one can think of will at most provide 
convergence guarantees to a stationary point that may not be globally optimum. Nevertheless, simulation 
results in Section I VII I demonstrate that Algorithm |2] is effective in providing good solutions most of 
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Algorithm 2 : Batch robust PC A solver with controllable rank 
Set 0(0) = Ojvxp, and randomly initialize S(0). 

for fc = 1,2, . . . do 

Update m(A:) [X - 0(fc - l)]'lAr/iV. 

Form Xo(fc) = X - lArm'(fc) - 0(fc - 1). 

Update U(fc) = Xo(fc)'S(fc - l)[S'(fc - l)S(fc - 1) + {K/2)Ig]-K 
Update S(fc) = Xo(fc)U(fc)[U'(fc)U(fc) + {X^/2)Ig]-\ 
Update 0(fc) = 5 [X - S(fc)U'(fc), A2/2] . 
end for 



the time, which is somehow expected since there is quite a bit of structure in (fTSl ). Formally, the next 
proposition adapted from |[25l Prop. 1] provides a sufficient condition under which Algorithm |2] yields an 
optimal solution of (fT4l) . For a proof of a slightly more general result, see [25 1. 

Proposition 3: If {U,S,0} is a stationary point of (fTSl) and ||X — SU' — 0||2 < A*/2, then {L := 
SU', O := 0} is the optimal solution of (114b . 

V. Robust Subspace Tracking 

E-commerce and Internet-based retailing sites, the World Wide Web, and video surveillance systems 
generate huge volumes of data, which far outweigh the ability of modern computers to analyze them 
in real time. Furthermore, data are generated sequentially in time, which motivates updating previously 
obtained learning results rather than re-computing new ones from scratch each time a new datum becomes 
available. This calls for low-complexity real-time (adaptive) algorithms for robust subspace tracking. 

One possible adaptive counterpart to ([7]) is the exponentially-weighted LS (EWLS) estimator found by 

N 

mill [||x„-m-Us„-o„||2 + A2||o„||2] (17) 

{v,o}^ '- 

where /3 G (0, 1] is a forgetting factor. In this context, n should be understood as a temporal variable, 
indexing the instants of data acquisition. Note that in forming the EWLS estimator ([TT] ) at time N, the 
entire history of data {:x.n}n=i is incorporated in the real-time estimation process. Whenever /5 < 1, 
past data are exponentially discarded thus enabling operation in nonstationary environments. Adaptive 
estimation of sparse signals has been considered in e.g., HI and |[24l . 

Towards deriving a real-time, computationally efficient, and recursive (approximate) solver of ( fTTl) . an 
AM scheme will be adopted in which iterations k coincide with the time scale n = 1,2,... of data 
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acquisition. Per time instant n, a new datum x„ is drawn and the corresponding pair of decision variables 
{s(ri),o(n)} are updated via 

{s(n), o(?i)} := arg min ||x„, — m(n — 1) — U(n — l)s — o||2 + A2II0II2. (18) 

{s,o} 

As per (ITSl) . only o(n) is updated at time n, rather than the whole (growing with time) matrix O that 
minimization of ([TT] ) would dictate; see also ll24l for a similar approximation. 

Because (fTSl ) is a smooth optimization problem w.r.t. s, from the first-order optimality condition the 
principal component update is s(n) = U'(n — l)[x„ — m(n — 1) — o(n)]. Interestingly, this resembles the 
projection approximation adopted in BOl . and can only be evaluated after o(n) is obtained. To this end, 
plug s{n) in ([TS] ) to obtain o{n) via a particular instance of the group Lasso estimator 

o(n) = arg min II [Ip - U(n - l)U'(n - 1)1 (x.„ - min - 1) - o)||2 + A2II0II2 (19) 

o 

with a single group of size equal to p. The cost in ( fT9l ) is non-differentiable at the origin, and different 
from e.g., ridge regression, it does not admit a closed-form solution. Upon defining 

H(n) := 2[Ip - U(n - l)U'(n - l)]'[Ip - U(n - l)U'(n - 1)] G (20) 

g(n) := - H(n)[x„ - m(n - 1)] G W (21) 

one can recognize ( [T9l ) as the multidimensional shrinkage-thresholding operator 7h(„) A2 (g('^)) introduced 
in II27I . In particular, as per ll27l Corollary 2] it follows that 

/ -(H(n)+7lp)-Mn), if ||g(n)||2>A2 
o(n) = 7ii(„),A2(g('^)) = < (22) 

I Op, otherwise 

where parameter 7 := A2/(277) is such that 77 > solves the scalar optimization 

min f 1 - g'(n) (2r/H(n) + A2) g(n)) r/. (23) 

Remarkably, one can easily determine if o(n) = Op, by forming g(?i) and checking whether ||g(n)||2 < A2. 
This will be the computational burden incurred to solve ([T9l ) for most n, since outliers are typically 
sporadic and one would expect to obtain o(n) = Op most of the time. When datum x„ is deemed an 
outlier, ||g(n)||2 > A2, and one needs to carry out the extra line search in (|23l ) to determine o{n) as per 
(I22I ): further details can be found in in ll27l . Whenever an ^i-norm outlier regularization is adopted, the 
resulting counterpart of ( fT9l ) can be solved using e.g., coordinate descent 0], or, the Lasso variant of 
least-angle regression (LARS) Il24l . 

Moving on, the subspace update is given by 



U(n) = *||xj — m(i — 1) — Us(2) — o( 



i=l 
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Algorithm 3 : Online robust (OR-)PCA 

\* Batch initialization phase 

Determine A2 and U(?7o) from {x„}J^2=i' in Section UlI-BI Initialize P{no) = lO'^Ip and s(no) = 0^. 

\* Online phase 

for n = no + 1 , no + 2, . . . do 

Form H(n) and g(n) using ( |20l ) and ( |2TI ). 

Update 0(71) = 7ii(„),A2(gH) via 

Update s(n) = U'(n — l)[x„ — o(n)]. 

\* RLS subspace update 

Update k(?i) = P{n - l)s{n)/[/3 + s' {n)P{n - l)s{n)]. 
Update P(n) = (l//3)[P(n - 1) - k(n)(P(n - l)s(n))']. 
Update U(n) = U(n - 1) + [x„ - U(7i - l)s(7i) - o(n)]k'(n). 
end for 



and can be efficiently obtained from U(n — 1), via a recursive LS update leveraging the matrix inversion 
lemma; see e.g., HOl . Note that the orthonormality constraint on U is not enforced here, yet the deviation 
from orthonormality is typically small as observed in POl . Still, if orthonormal principal directions are 
required, an extra orthonormaUzation step can be carried out per iteration, or, once at the end of the process. 
Finally, m(n) is obtained recursively as the exponentially-weighted average of the outlier-compensated 
data {xj — o(i)}"^]^. The resulting online robust (OR-)PCA algorithm and its initialization are summarized 
under Algorithm [3l where m and its update have been omitted for brevity. 

For the batch case where all data in 7^ are available for joint processing, two data-driven criteria to 
select A2 have been outlined in Section ITlI-BI However, none of these sparsity-controlling mechanisms can 
be run in real-time, and selecting A2 for subspace tracking via OR-PCA is challenging. One possibility to 
circumvent this problem is to select A2 once during a short initialization (batch) phase of OR-PCA, and 
retain its value for the subsequent time instants. Specifically, the initialization phase of OR-PCA entails 
solving Q using Algorithm [T] with a typically small batch of data {x„}^|L^. At time no, the criteria in 
Section Ull-B I are adopted to find the 'best' A2, and thus obtain the subspace estimate U(rio) required to 
initialize the OR-PCA iterations. 

Convergence analysis of OR-PCA algorithm is beyond the scope of the present paper, and is only 
confirmed via simulations. The numerical tests in Section IVIII also show that in the presence of outliers, 
the novel adaptive algorithm outperforms existing non-robust alternatives for subspace tracking. 
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VI. ROBUSTIFYING KERNEL PCA 

Kernel (K)PCA is a generalization to (linear) PCA, seeking principal components in a feature space 
nonlinearly related to the input space where the data in Tx live ||33|. KPCA has been shown effective 
in performing nonlinear feature extraction for pattern recognition [33 1. In addition, connections between 
KPCA and spectral clustering ifTTl p. 548] motivate well the novel KPCA method developed in this section, 
to robustly identify cohesive subgroups (communities) from social network data. 

Consider a nonlinear function cf) : W ^ T-L, that maps elements from the input space W to a 
feature space T-L of arbitrarily large - possibly infinite - dimensionality. Given transformed data Th := 
{(t){'x.n)}n=i' proposed approach to robust KPCA fits the model 

0(x„) = m + Us„ + e.„ + o„, n=l,...,iV (24) 

by solving (* := [0(xi), . . . , 0(xAr)]) 

mill II*' - IW - SU' - Ofp + ^(||U||| + ||S||?.) + A2||0||2r. (25) 

U,S,0 2 MM: 

It is certainly possible to adopt the criterion ([T]) as well, but (l25l) is chosen here for simplicity in exposition. 
Except for the principal components' matrix S G M^^'', both the data and the unknowns in (1251 ) are now 
vectors/matrices of generally infinite dimension. In principle, this challenges the optimization task since 
it is impossible to store, or, perform updates of such quantities directly. For these reasons, assuming zero- 
mean data 0(x„), or, the possibility of mean compensation for that matter, cannot be taken for granted 
here [cf. Remark [H. Thus, it is important to explicitly consider the estimation of m. 

Interestingly, this hurdle can be overcome by endowing T-L with the structure of a reproducing kernel 
Hilbert space (RKHS), where inner products between any two members of T-L boil down to evaluations 
of the reproducing kernel K-^ : x RP — M, i.e., (0(xj), (/)(xj))-^ = K-}{{-x.i,Di.j). Specifically, it is 
possible to form the kernel matrix K := G R^^^, without directly working with the vectors in T-L. 
This so-termed kernel trick is the crux of most kernel methods in machine learning ifTTI . including kernel 
PCA ll33l . The problem of selecting K'^^ (and indirectly) will not be considered here. 

Building on these ideas, it is shown in the sequel that Algorithm |2] can be kernelized, to solve (l25l) 
at affordable computational complexity and memory storage requirements that do not depend on the 
dimensionality of H. 

Proposition 4: For k > 1, the sequence of iterates generated by Algorithm |2] when applied to solve (1251 ) 
can be written as m{k) = U(A;) = <^t{k), and 0'{k) = <^Vl{k). The quantities fj,{k) G M^, 

r{k) G M^^"?, and n{k) G M^^^ are recursively updated as in Algorithm |?] without the need of 
operating with vectors in %. 
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Proof: The proof relies on an inductive argument. Suppose that at iteration k — \, there exists a matrix 
^{k — 1) G M^^^ such that the outliers can be expressed as 0'(A; — 1) = ^Vl{k — 1). From Algorithm 
H the update for the mean vector is m(/c) = - 0(/c - l)]'lAr/iV = - ^n{k - 1)]1n /N = ^^{k) 
where fx{k) := [I„ - ^(A; - 1)]1n/N. Likewise, Xo{k) = - lNtJ.'{k)^' - il'{k - 1)*' so that one 
can write the subspace update as U(/c) = $T(A;), upon defining 

T(A;) := [I,v - t^ik)!'^ - ^ik - l)]S{k - l)[S'{k - l)S{k - 1) + {K/2)Ig]~\ 

With regards to the principal components, it follows that (cf. Algorithm ^ 

S{k) = [In - iNf^'ik) - ft'{k - l)]*'*T(fc)[T(A;)'*'*T(A:) + {X,/2)Ig]-^ 

= [In - iNf^'ik) - ft'ik - l)]KT(fe)[T(fe)'KT(fc) + {X,/2)Ig]-^ (26) 

which is expressible in terms of the kernel matrix K := Finally, the columns o„(A;) are given by 

the vector soft-thresholding operation ([TOl l. where the residuals are 

Vnik) = 0(x„) - m{k) - U(/c)s„(/c) = *[bAr,„ - n{k) - T(fc)s„(A;)] := ^pnik). 

Upon stacking all columns On{k), n = 1, . . . ,N, one readily obtains [cf. ([T0l )1 

o'{k) = ^[In - Kk)i'N - rik)s'{k)]A{k) ill) 

where A(A;) := diag((||ri(A;)||2-A2/2)+/||ri(A:)||2, . . . , (||r^(A;)||2-A2/2)+/||r;v(A:)||2)- Interestingly, the 
diagonal elements of A(A;) can be computed using the kernel matrix, since ||r„(/c)||2 = ^ p^(A;)Kp„(A;), 
n = 1, . . . , N . From (|27] ) it is apparent that one can write 0'(A;) = ^^l{k), after defining 

n{k) := [In - ti{k)l'N - r{k)S'{k)]A{k). 

The proof is concluded by noting that for k = 0, Algorithm |2] is initialized with O'(0) = OpxA^- One can 
thus satisfy the inductive base case O'(0) = $0(0), by letting Cl{0) = OnxN- ■ 

In order to run the novel robust KPCA algorithm (tabulated as Algorithm lU, one does not have to 
store or process the quantities m(/c), U(A;), and 0{k). As per Proposition IH the iterations of the provably 
convergent AM solver in Section ITV-B I c an be equivalently carried out by cycling through finite-dimensional 
'sufficient statistics' ^i{k) — T(A;) — S(A;) — )■ Q{k). In other words, the iterations of the robust kernel 
PC A algorithm are devoid of algebraic operations among vectors in T-L. Recall that the size of matrix S 
is independent of the dimensionality of T-L. Nevertheless, its update in Algorithm |2] cannot be carried out 
verbatim in the high-dimensional setting here, and is instead kernelized to yield the update rule ( [261) . 

Because 0'{k) = ^Q,{k) and upon convergence of the algorithm, the outlier vector norms are com- 
putable in terms of K, i.e., [||oi(oo)||2, • • • , ||oAr(oo)||2]' = diag[i7'(oo)Ki7(oo)]. These are critical to 
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Algorithm 4 : Robust KPCA solver 

Initialize r2(0) = O^xn, S(0) randomly, and form K = ^>'*. 

for fc = 1,2, . . . do 

Update n{k) = [I„ - 0(fc - l)]ljv/^- 

Form = In- fJ-ik)!']^ - n(k - 1). 

Update r{k) = *o(fc)S(/fc - l)[S'(fc - l)S{k - 1) + (X,/2)Ig]-\ 
Update S(fc) = *;,(fc)KY(fc)[Y(fc)'KT(A:) + {X,/2)Ig]-\ 
Form Pn{k) ~ „ — /x(fc) — Y(fc)s„(A:), = 1, . . . ,N, and update A(fc). 
Update n{k) = [In - nik)!^ - T(fc)S'(fc)] A(fc). 
end for 



determine the robustification paths needed to carry out the outlier sparsity control methods in Section 
IIII-BI Moreover, the principal component corresponding to any given new data point x is obtained through 
the projection s = U(oo)'[0(x) — m(oo)] = T'(oo)$'0(x) — T'(oo)K/x(oo), which is again computable 
after N evaluations the kernel function Ky^. 

VII. Numerical Tests 

A. Synthetic data tests 

To corroborate the effectiveness of the proposed robust methods, experiments with computer gener- 
ated data are carried out first. These are important since they provide a 'ground truth', against which 
performance can be assessed by evaluating suitable figures of merit. 

Outlier-sparsity control. To generate the data dill, a similar setting as in B2l Sec. V] is considered 
here with N = p and m = Op. For n = 1,...,N, the errors ore ^ ( Op , (Tg Ip ) (multivariate 
normal distribution) and i.i.d. The entries of U and {sn}n=i i-i-d- zero-mean Gaussian distributed, 
with variance crfj^ = lOcJe/V^- Outliers are generated as o„ = p„ q„, where the entries of p„ are 
i.i.d. Bernoulli distributed with parameter pp, and q„ has i.i.d. entries drawn from a uniform distribution 
supported on [—5, 5]. The chosen values of the parameters are N = p = 200, q = 20, pp = 0.01, and 
varying noise levels = {0.01,0.05,0.1,0.25,0.5}. 

In this setup, the ability to recover the low-rank component of the data L := SU' is tested for the 
sparsity-controUing robust PCA method of this paper [cf. dV])], stable PCP (fT4l ). and (non-robust) PCA. 
The ^i-norm regularized counterparts of (|7]l and (fT4b are adopted to deal with entry-wise outliers. Both 
values of q and cTg are assumed known to obtain L := SU' and O via ([7]). This way, A2 is chosen using the 
sparsity-controUing algorithm of Section Ull-B [ searching over a grid where Gx = 200, Amin = 10~^Amax, 
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and Aniax = 20. In addition, the solutions of ^ are refined by running two iterations of the iteratively 
reweighted algorithm in Section IIV-AI where 6 = 10~^. Regarding SPCP, only the knowledge of cjg is 
required to select the tuning parameters A* = 2\j2Na\ and A2 = 2\/2a1 in ([T4l) . as suggested in H2l . 
Finally, the best rank q approximation to the data X is obtained using standard PCA. 

The results are summarized in Table IH which shows the estimation errors err := ||L — LUi^/A^ attained 
by the aforementioned schemes, averaged over 15 runs of the experiment. The 'best' tuning parameters 
A2 used in (|7]l are also shown. Both robust schemes attain an error which is approximately an order 
of magnitude smaller than PCA. With the additional knowledge of the true data rank q, the sparsity- 
controUing algorithm of this paper outperforms stable PCP in terms of err. This numerical test is used to 
validate Proposition [3] as well. For the same values of the tuning parameters chosen for ([T4l ) and the rank 
upper-bound set to q = 2q, Algorithm |2] is run to obtain the solution {U, S, 0} of the nonconvex problem 
(fTSl ). The average (across realizations and values of fig) errors obtained are ||L — SU'lli^/A^ = 0.15 x 10~^ 
and ||6-0||F/iV = 0.78 x lO"'^, where {L, 6} is the solution of stable PCP [cf. d)]. Thus, the solutions 
are identical for all practical purposes. 

Identification of invalid survey protocols. Robust PCA is tested here to identify invalid or otherwise 
aberrant item response (questionnaire) data in surveys, that is, to flag and hold in abeyance data that may 
negatively influence (i.e., bias) subsequent data summaries and statistical analyses. In recent years, item 
response theory (IRT) has become the dominant paradigm for constructing and evaluating questionnaires 
in the biobehavioral and health sciences and in high-stakes testing (e.g., in the development of college 
admission tests); see e.g., |[37l . IRT entails a class of nonlinear models characterizing an individual's item 
response behavior by one or more latent traits, and one or more item parameters. An increasingly popular 
IRT model for survey data is the 2-parameter logistic IRT model (2PLM) |[3T| . 2PLM characterizes the 
probability of a keyed (endorsed) response ynm, as a nonlinear function of a weighted difference between 
a person parameter 6'„ and an item parameter bm 

gl.7a„(6»„-b„.) 

Pr(y„^ = 1|^„) = (28) 

where 9n is a latent trait value for individual n; Om is an item discrimination parameter (similar to a factor 
loading) for item m; and bm is an item difficulty (or extremity) parameter for item m. 

Binary item responses ('agree/disagree' response format) were generated for N = 1,000 hypothetical 
subjects who were administered p = 200 items (questions). The 2PLM function ( |28] ) was used to generate 
the underlying item response probabilities, which were converted into binary item responses as follows: a 
response was coded 1 if Pi:{ynm\0n) > ^(0, 1), and coded otherwise, where U[0, 1] denotes a uniform 
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random deviate over [0, 1]. Model parameters were randomly drawn as {am}^£i ~ 1.5], {bm}m=i ~ 
U[-2,2], and {Oi}f^^^ ~ AA(05,l5). Each of the 200 items loaded on one of g = 5 latent factors. To 
simulate random responding - a prevalent form of aberrancy in e.g., web-collected data - rows 101- 
120 of the item response matrix Y were modified by (re)drawing each of the entries from a BernoulU 
distribution with parameter 0.5, thus yielding the corrupted matrix X. 

Robust PCA in ^ was adopted to identify invalid survey data, with q = 5, and A2 chosen such that 
||0||o = 150, a safe overestimate of the number of outliers. Results of this study are summarized in Fig.[T] 
which displays the 100 largest outliers (||6„||2) from the robust PCA analysis of the = 1, 000 simulated 
response vectors. When the outliers are plotted against their ranks, there is an unmistakable break between 
the 20th and 21st ordered value indicating that the method correctly identified the number of aberrant 
response patterns in X. Perhaps more impressively, the method also correctly identified rows 101 -to- 120 
as containing the invalid data. 

Online robust subspace estimation. A simulated test is carried out here to corroborate the convergence 
and effectiveness of the OR-PCA algorithm in Section jV] For N = 2,000, p = 150, and (7 = 5, nominal 
data in Ty are generated according to the stationary model ([T|l, where e„ ~ A/'(Op, 10~^Ip). Vectors 
xiooi, • • • , X1005 are outliers, uniformly i.i.d. over [—0.5, 0.5]. The results depicted in Fig. |2] are obtained 
after averaging over 50 runs. Fig. |2] (left) depicts the time evolution of the angle between the learnt 
subspace (spanned by the columns of) U(n) and the true subspace U generating Ty, where A2 = 1.65 
and (3 = 0.99. The convergent trend of Algorithm |3] to U is apparent; and markedly outperforms the 
non-robust subspace tracking method in POl . and the first-order GROUSE algorithm in 121. Note that 
even though U is time-invariant, it is meaningful to select ^ /3 < 1 to quickly 'forget' and recover 
from the outliers. A similar trend can be observed in Fig. |2] (right), which depicts the time evolution of 
the reconstruction error ||yn — U(^)U(n)'yn||2/P- 

Robust spectral clustering. The following simulated test demonstrates that robust KPCA in Section 
IVll can be effectively used to robustify spectral clustering (cf. the connection between both non-robust 
methods in e.g., ||T7l p. 548]). Adopting the data setting from HT] p. 546]), N = 450 points in 
are generated from three circular concentric clusters, with respective radii of 1, 2.8, and 5. The points 
are uniformly distributed in angle, and additive noise ~ A/'(02i O.I5I2) is added to each datum. Five 
outliers {x„}^^45^ uniformly distributed in the square [—7,7]^ complete the training data 7^; see Fig. |3] 
(left). To unveil the cluster structure from the data. Algorithm |4] is run using the Gaussian radial kernel 
ir(xj,Xj) = exp(— ||xj — Xj||2/c), with c = 10. The sparsity-controUing parameter is set to A2 = 1.85 
so that ||0||o = 5, while = 1, and q = 2. Upon convergence, the vector of estimated outlier norms is 
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[||oi(oo)|||, . . . , ||o7v+5(oo)||i]' = [0, ... ,0, 10"^, 1.3 X 10-3, 1.5 x lO'^, 10-^ 1.7 x 10"^]', which shows 
that the outliers are correctly identified. Estimates of the (rotated) first two dominant eigenvectors of the 
kernel matrix K are obtained as the columns of T, and are depicted in Fig. [3] (right). After removing 
the rows of T corresponding to the outliers [black points in Fig. [3] (right)], e.g., K-means clustering of 
the remaining points in Fig. |3] (right) will easily reveal the three clusters sought. From Fig. |3] (right) it is 
apparent that a non-robust KPCA method will incorrectly assign the outliers to the outer (green) cluster. 

B. Real data tests 

Video surveillance. To validate the proposed approach to robust PCA, Algorithm [T] was tested to perform 
background modeling from a sequence of video frames; an approach that has found widespread applica- 
bility for intrusion detection in video surveillance systems. The experiments were carried out using the 
dataset studied in HI, which consists of = 520 images (p = 120 x 160) acquired from a static camera 
during two days. The illumination changes considerably over the two day span, while approximately 40% 
of the training images contain people in various locations. For q = 10, both standard PCA and the robust 
PCA of Section |lll] were applied to build a low-rank background model of the environment captured by 
the camera. For robust PCA, ^i-norm regularization on O was adopted to identify outliers at a pixel level. 
The outlier sparsity-controUing parameter was chosen as A2 = 9.69 x 10^'^, whereas a single iteration of 
the reweighted scheme in Section IIV-AI was run to reduce the bias in O. 

Results are shown in Fig. [TJ for three representative images. The first column comprises the original 
frames from the training set, while the second column shows the corresponding PCA image reconstructions. 
The presence of undesirable 'ghostly' artifacts is apparent, since PCA is unable to completely separate the 
people from the background. The third column illustrates the robust PCA reconstructions, which recover 
the illumination changes while successfully subtracting the people. The fourth column shows the reshaped 
outlier vectors 6„, which mostly capture the people and abrupt changes in illumination. 
Robust measurement of the Big Five personality factors. The 'Big Five' are five factors {q = 5) of 
personality traits, namely extraversion, agreeableness, conscientiousness, neuroticism, and openness; see 
e.g., lIlTI . The Big Five inventory (BFI) on the other hand, is a brief questionnaire (44 items in total) 
tailored to measure the Big Five dimensions. Subjects taking the questionnaire are asked to rate in a 
scale from 1 (disagree strongly) to 5 (agree strongly), items of the form 'I see myself as someone who is 
talkative'. Each item consists of a short phrase correlating (positively or negatively) with one factor; see 
e.g., Il2n pp. 157-58] for a copy of the BFI and scoring instructions. 

Robust PCA is used to identify aberrant responses from real BFI data comprising the Eugene-Springfield 



IEEE TRANSACTIONS ON SIGNAL PROCESSING (SUBMITTED) 



23 



community sample |[T6l . The rows of X contain the p = 44 item responses for each one of the = 437 
subjects under study. For g = 5, (O is solved over grid of G\ = 200 values of A2, where Amin = 10~^Amax> 
and Amax = 20. The first plot of Fig. [5] (left) shows the evolution of O's row support as a function of 
A2 with black pixels along the nth row indicating that ||6„,||2 = 0, and white ones reflecting that the 
responses from subject n are deemed as outliers for the given A2. For example subjects n = 418 and 
204 are strong outlier candidates due to random responding, since they enter the model (||6„||2 > 0) for 
relatively large values of A2. The responses of e.g., subjects n = 63 (all items rated '3') and 249 (41 items 
rated '3' and 3 items rated '4') are also undesirable, but are well modeled by ([1]) and are only deemed 
as outliers when A2 is quite small. These two observations are corroborated by the second plot of Fig. [5] 
(left), which shows the robust PCA results on a corrupted dataset, obtained from X by overwriting: (i) 
rows 151 — 160 with random item responses drawn from a uniform distribution over {1,2,3,4,5}; and 
(ii) rows 301 — 310 with constant item responses of value 3. 

For A2 = 5.6107 corresponding to ||0||o = 100, Fig. [5] (right) depicts the norm of the 40 largest 
outliers. Following the methodology outlined in Section [VII-AI 8 subjects including n = 418 and 204 are 
declared as outliers by robust PCA. As a means of validating these results, the following procedure is 
adopted. Based on the BFI scoring key f2T\. a list of all pairs of items hypothesized to yield positively 
correlated responses is formed. For each n, one counts the 'inconsistencies' defined as the number of 
times that subject n's ratings for these pairs differ in more than four, in absolute value. Interestingly, after 
rank-ordering all subjects in terms of this inconsistency score, one finds that n = 418 ranks highest with 
a count of 17, n = 204 ranks second (10), and overall the eight outliers found rank in the top twenty. 
Unveiling communities in social networks. Next, robust KPCA is used to identify communities and 
outliers in a network of = 115 college football teams, by capitalizing on the connection between 
KPCA and spectral clustering [iTj, p. 548]. Nodes in the network graph represent teams belonging to 
eleven conferences (plus five independent teams), whereas (unweighted) edges joining pairs of nodes 
indicate that both teams played against each other during the Fall 2000 Division I season |[T5l . The kernel 
matrix used to run robust KPCA is K = CIjv + D~^/^AD~^/^, where A and D denote the graph 
adjacency and degree matrices, respectively; while C > is chosen to render K positive semi-definite. 
The tuning parameters are chosen as A2 = 1.297 so that ||0||o = 10, while A* = 1, and q = 3. Fig. [6] 
(left) shows the entries of K, where rows and columns are permuted to reveal the clustering structure 
found by robust KPCA (after removing the outliers); see also Fig. [6] (right). The quality of the clustering 
is assessed through the adjusted rand index (ART) after excluding outliers lfT2ll . which yielded the value 
0.8967. Four of the teams deemed as outliers are Connecticut, Central Florida, Navy, and Notre Dame, 
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which are indeed teams not belonging to any major conference. The community structure of traditional 
powerhouse conferences such as Big Ten, Big 12, ACC, Big East, and SEC was identified exactly. 

VIII. Concluding Summary 

Outlier-robust PCA methods were developed in this paper, to obtain low-dimensional representations of 
(corrupted) data. Bringing together the seemingly unrelated fields of robust statistics and sparse regression, 
the novel robust PCA framework was found rooted at the crossroads of outlier-resilient estimation, learning 
via (group-) Lasso and kernel methods, and real-time adaptive signal processing. Social network analysis, 
video surveillance, and psychometric s, were highlighted as relevant application domains. 

Acknowledgment: The authors would like to thank Prof. Niels Waller (Department of Psychology, 
University of Minnesota) for the fruitful discussions on IRT and the measurement of the Big Five; and 
Dr. Lewis Goldberg (Oregon Research Institute) for facilitating access to the BFI data studied in Section 
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Appendix 

Towards establishing the equivalence between problems ^ and ([U), consider the pair {V, O} that 
solves Q. Assume that V is given, and the goal is to determine O. Upon defining the residuals r„ := 
x„ — rh — Us„ and from the row-wise decomposability of || • ||2,r> the rows of O are separately given by 

6„ := arg mill [||r„ - o„||^ + A2||o„||2l , n = l,...,N. (29) 

For each n = 1, . . . , N, because ( |29l ) is nondifferentiable at the origin one should consider two cases: i) 
if On = Op, it follows that the minimum cost in ( |29l ) is Urnlll; otherwise, ii) if ||6„||2 > 0, the first-order 
condition for optimality gives 6„ = f „ — (A2/2)r„/||r„||2 provided ||rn||2 > A2/2, and the minimum cost 
is A2||rn||2 — ^2/^- Compactly, the solution of ( [29l ) is given by 6„ = r„(||r„||2 — A2/2)+/||f„||2 , while 
the minimum cost in ( [29l ) after minimizing w.r.t. o„ is Pt,(r„) [cf. ^ and the argument following ([29l)1. 
The conclusion is that V is the minimizer of ([8]l, in addition to being the solution of (|7]l by definition. 



IEEE TRANSACTIONS ON SIGNAL PROCESSING (SUBMITTED) 



25 



References 

[1] D. Angelosante, J. A. Bazerque, and G. B. Giannakis, "Online adaptive estimation of sparse signals: Where RLS meets the 

^i-norm," IEEE Trans. Signal Process, vol. 58, pp. 3436-3447, Jul. 2010. 
[2] L. Balzano, R. Nowak, and B. Recht, "Online identification and tracking of subspaces from highly incomplete information," 

in Proc. of 48th AUerton Conference, Monticello, IL, Sep./Oct. 2010, pp. 704-711. 
[3] D. R Bertsekas, Nonlinear Programming, 2nd ed. Athena-Scientific, 1999. 

[4] N. A. Campbell, "Robust procedures in multivariate analysis i: Robust covariance estimation," Applied Stat., vol. 29, pp. 
231-237, 1980. 

[5] E. J. Candes, X. Li, Y. Ma, and J. Wright, "Robust principal component analysis?" Journal of the ACM, vol. 58, Article 
No. 11, Mar. 2011. 

[6] E. J. Candes, M. B. Wakin, and S. Boyd, "Enhancing sparsity by reweighted l\ minimzation," Journal of Fourier Analysis 

and Applications, vol. 14, pp. 877-905, Dec. 2008. 
[7] V. Chandrasekaran, S. Sanghavi, R A. Parillo, and A. S. Willsky, "Rank- sparsity incoherence for matrix decomposition," 

SIAM Journal on Optimization, vol. 21, pp. 572-596, 2011. 
[8] F. de la Torre and M. J. Black, "A framework for robust subspace learning," Int. Jrnl. of Computer Vision, vol. 54, pp. 

1183-209, 2003. 

[9] X. Ding, L. He, and L. Carin, "Bayesian robust principal component analysis," IEEE Trans. Image Process., vol. 20, 2011. 
[10] J. Fan and R. Li, "Variable selection via nonconcave penalized likelihood and its oracle properties," / Amer Stat. Assoc., 
vol. 96, pp. 1348-1360, 2001. 

[11] M. Fazel, H. Hindi, and S. Boyd, "Log-det heuristic for matrix rank minimization with applications to Hankel and Euclidean 

distance matrices," in Proc. of the American Control Conf, Denver, CO, Jun. 2003, pp. 2156-2162. 
[12] R Forero, V. Kekatos, and G. B. Giannakis, "Outlier-aware robust clustering," in Proc. of Intl. Conf. on Acoustics, Speech 

and Signal Processing, Prague, Czech Republic, May 2011, pp. 2244-2247. 
[13] J. J. Fuchs, "An inverse problem approach to robust regression," in Proc. of Intl. Conf. on Acoustics, Speech and Signal 

Processing, Phoeniz, AZ, Mar. 1999, pp. 180-188. 
[14] G. B. Giannakis, G. Mateos, S. Farahmand, V. Kekatos, and H. Zhu, "USPACOR: Universal sparsity-controlling outlier 

rejection," in Proc. of Intl. Conf. on Acoust., Speech and Signal Proc, Prague, Czech Republic, May 2011, pp. 1952-1955. 
[15] M. Girvan and M. E. J. Newman, "Community structure in social and biological networks," Proc. Natl. Acad. Sci. USA, 

vol. 99, pp. 7821-7826, 2002. 

[16] L. R. Goldberg, "The Eugene-Springfield community sample: Information available from the research participants," Oregon 

Research Institue, Tech. Rep. vol. 48, no. 1, 2008. 
[17] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, 2nd ed. Springer, 2009. 
[18] J. He, L. Balzano, and J. C. S. Lui, "Online robust subspace tracking from partial information," 2011, see also 

arXiv:1109.3827v2 [cs.IT]. 
[19] P. J. Huber and E. Ronchetti, Robust Statistics. New York: Wiley, 2009. 

[20] Y. Jin and B. D. Rao, "Algorithms for robust linear regression by exploiting the connection to sparse signal recovery," in 
Proc. of Intl. Conf. on Acoustics, Speech and Signal Processing, Dallas, TX, Mar. 2010, pp. 3830-3833. 

[21] O. P. John, L. P. Naumann, and C. J. Soto, "Paradigm shift to the integrative big-five trait taxonomy: History, measurement, 
and conceptual issues," in Handbook of personality: Theory and research, O. P. John, R. W. Robins, and L. A. Pervin, Eds. 
New York, NY: Guilford Press, 2008. 



IEEE TRANSACTIONS ON SIGNAL PROCESSING (SUBMITTED) 



26 



[22] I. T. JoUiffe, Principal Component Analysis. New York: Springer, 2002. 

[23] V. Kekatos and G. B. Giannakis, "From sparse signals to sparse residuals for robust sensing," IEEE Trans, on Signal 

Proces.sing, vol. 59, pp. 3355-3368, Jul. 2011. 
[24] J. Mairal, J. Bach, J. Ponce, and G. Sapiro, "Online learning for matrix factorization and sparse coding," Jrnl. of Machine 

Learning Research, vol. 11, pp. 19-60, Jan. 2010. 
[25] M. Mardani, G. Mateos, and G. B. Giannakis, "Unveiling network anomalies in large-scale networks via sparsity and low 

rank," in Proc. of 44th Asilomar Conf. on Signals, Systems, and Computers, Pacific Grove, CA, Nov. 2011. 
[26] B. K. Natarajan, "Sparse approximate solutions to linear systems," SIAM J. Compiit., vol. 24, pp. 227-234, 1995. 
[27] A. T. Puig, A. Wiesel, and A. O. Hero, "Multidimensional shrinkage-thresholding operator and group LASSO penalties," 

IEEE Signal Process. Letters, vol. 18, pp. 363-366, Jun. 2011. 
[28] I. Ramirez, F. Lecumberry, and G. Sapiro, "Universal priors for sparse modeling," in Proc. of 3rd Intl. Workshop on Comp. 

Advances in Multi-Sensor Adapt. Process., Aruba, Dutch Antilles, Dec. 2009, pp. 197-200. 
[29] B. Recht, M. Fazel, and P. A. Parrilo, "Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm 

minimization," SIAM Rev., vol. 52, pp. 471-501, 2010. 
[30] B. Recht and C. Re, "Parallel stochastic gradient algorithms for large-scale matrix completion," 2011, (submitted). 

[Online]. Available: |http://pages.cs.wisc.edu/~brecht/papers/l 1 .Rec.Re.IPGM.pdf | 
[31] S. P. Reise and N. G. Waller, "Traitedness and the assessment of response pattern scalability," Journal of Personality and 

Social Psychology, vol. 65, pp. 143-151, 1993. 
[32] P. J. Rousseeuw and A. M. Leroy, Robust regression and outlier detection. New York: Wiley, 1987. 
[33] B. Schlkopf, A. Smola, and K.-R. Mller, "Kernel principal component analysis," Artificial Neural Networks: Lec. Notes in 

Computer Science, vol. 1327, pp. 583-588, 1997. 
[34] I. Tosic and P. Frossard, "Dictionary learning," IEEE Signal Process. Mag., vol. 28, pp. 27-38, Mar. 2010. 
[35] J. Tropp, "Just relax: Convex programming methods for identifying sparse signals," IEEE Trans, on Information Theory, 

vol. 51, pp. 1030-1051, Mar. 2006. 
[36] P. Tseng, "Convergence of block coordinate descent method for nondifferentiable maximization," J. Optim. Theory AppL, 

vol. 109, pp. 473^92, 2001. 

[37] N. Waller and S. Reise, "Measuring psychopathology with non-standard IRT models: Fitting the four parameter model to the 

MMPI," in New Directions in Psychological Measurement with Model-Based Approaches, S. Embretson, Ed. Washington, 

DC: Amer. Psych. Assoc., 2010. 
[38] H. Xu, C. Caramanis, and S. Sanghavi, "Robust PCA via outlier pursuit," 2010, see also arXiv: 1010.4237v2 [cs.LG]. 
[39] L. Xu and A. L. Yuille, "Robust principal component analysis by self-organizing rules based on statistical physics approach," 

IEEE Trans. Neural Nets., vol. 6, pp. 131-143, Jan. 1995. 
[40] B. Yang, "Projection approximation subspace tracking," IEEE Trans. Sig. Proc, vol. 43, pp. 95-107, Jan. 1995. 
[41] M. Yuan and Y. Lin, "Model selection and estimation in regression with grouped variables," J. Royal. Statist. Soc B, vol. 68, 

pp. 49-67, 2006. 

[42] Z. Zhou, X. Li, J. Wright, E. Candes, and Y. Ma, "Stable principal component pursuit," in Proc. of Intl. Symp. on Information 

Theory, Austin, TX, Jun. 2010, pp. 1518-1522. 
[43] H. Zou, T. Hastie, and R. Tibshirani, "Sparse principal component analysis," Jrnl. of Comp. and Graphical Statistics, vol. 15, 

no. 2, pp. 265-286, 2006. 



IEEE TRANSACTIONS ON SIGNAL PROCESSING (SUBMITTED) 



27 



TABLE I 
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Fig. 1. Pseudo scree plot of outlier size (||6n||2); the 100 largest outliers are shown. 



Subjects 101-120 had random 
item response patters 




Fig. 2. (Left) Time evolution of the angle between the learnt subspace U(n), and the true U used to generate the data (/? = 0.99 
and A2 = 1.65). Outlier contaminated data is introduced at time n = 1001. (Right) Time evolution of the reconstruction error. 
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Fig. 3. (Left) Data in ttiree concentric clusters, in addition to five outliers shown in black. (Right) Coordinates of the first two 
columns of T, obtained by running Algorithm ID The five outlying points are correctly identified, and thus can be discarded. 
Non-robust methods will assign them to the green cluster. 
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Fig. 4. Background modeling for video surveillance. First column: original frames. Second column: PCA reconstructions, where 
the presence of undesirable 'ghostly' artifacts is apparent, since PCA is not able to completely separate the people from the 
background. Third column: robust PCA reconstructions, which recover the illumination changes while successfully subtracting 
the people. Fourth column: outliers in 6, which mostly capture the people and abrupt changes in illumination. 
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Increasing index Increasing \^ index 



Fig. 5. (Left) Evolution of O's row support as a function of A2 - black pixels along the nth row indicate that ||6„||2 = 0, 
whereas white ones reflect that the responses from subject n are deemed as outliers for given A2. The results for both the original 
and modified (introducing random and constant item responses) BFI datasets are shown. (Right) Pseudo scree plot of outlier size 
(||6n||2); the 40 largest outliers are shown. Robust PCA declares the largest 8 as aberrant responses. 




Fig. 6. (Left) Entries of K after removing the outliers, where rows and columns are permuted to reveal the clustering structure 
found by robust KPCA. (Right) Graph depiction of the clustered network. Teams belonging to the same estimated conference 
(cluster) are colored identically. The outliers are represented as diamond-shaped nodes. 



